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Abstract 

We consider the problem of learning the structure of undirected graphical models with 
bounded treewidth, within the maximum likelihood framework. This is an NP-hard problem 
and most approaches consider local search techniques. In this paper, we pose it as a combi- 
natorial optimization problem, which is then relaxed to a convex optimization problem that 
involves searching over the forest and hyperforest polytopes with special structures, indepen- 
dently. A supergradient method is used to solve the dual problem, with a run-time complexity 
of 0(k 3 n k+2 logn) for each iteration, where n is the number of variables and k is a bound on 
the treewidth. We compare our approach to state-of-the-art methods on synthetic datasets and 
classical benchmarks, showing the gains of the novel convex approach. 



1 Introduction 

Graphical models provide a versatile set of tools for probabilistic modeling of large collections of 
interdependent variables. They are defined by graphs that encode the conditional independences 
among the random variables, together with potential functions or conditional probability distribu- 
tions that encode the specific local interactions leading to globally well-defined probability distribu- 
tions H |31 [TS]. 

In many domains such as computer vision, natural language processing or bioinformatics, the struc- 
ture of the graph follows naturally from the constraints of the problem at hand. In other situations, 
it might be desirable to estimate this structure from a set of observations. It allows (a) a statisti- 
cal fit of rich probability distributions that can be considered for further use, and (b) discovery of 
structural relationship between different variables. In the former case, distributions with tractable 
inference are often desirable, i.e., inference with run-time complexity does not scale exponentially 
in the number of variables in the model. The simplest constraint to ensure tractability is to impose 
tree-structured graphs [7]. However, these distributions are not rich enough, and following earlier 
work [2T] [TJ [51 [SJ [T3J [H] , we consider models with bounded treewidth, not simply by one (i.e., trees), 
but by a small constant k. 

Beyond the possibility of fitting tractable distributions (for which probabilistic inference has linear 
complexity in the number of variables), learning bounded-trccwidth graphical models is a key to 
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design approximate inference algorithms for graphs with higher treewidth. Indeed, as shown by |241 
1321117] . approximating general distributions by tractable distributions is a common tool in variational 
inference. However, in practice, the complexity of variational distributions is often limited to trees 
(i.e., k = 1), since these are the only ones with exact polynomial-time structure learning algorithms. 
The convex relaxation designed in this paper enables us to augment the applicability of variational 
inference, by allowing a finer trade-off between run-time complexity and approximation quality. 

Apart from trees, learning the structure of a directed or undirected graphical model, with or with- 
out constraints on the treewidth, remains a hard problem. Two types of algorithms have emerged, 
based on the two equivalent definitions of graphical models: (a) by testing conditional independence 
relationships [27] or (b) by maximizing the log-likelihood of the data using the factorized form of 
the distribution [TT]. In the specific context of learning bounded-treewidth graphical models, the 
latter approach has been shown to be NP-hard [55] and led to various approximate algorithms based 
on local search techniques [3T] [5] [T5J [T] [5S] [5S] while the former approach led to algorithms based 
on independence tests [221 [5] 113] , which have recovery guarantees when the data-generating distri- 
bution has low treewidth. Malvestuto [21] proposed a greedy heuristic of hyperedge selection with 
least incremental entropy. Deshpande et al. [9] proposed a simple edge selection technique that 
maintains decomposability of the graph while minimizing the KL-divcrgcnce to the original distri- 
bution. Karger et al. |15| proposed the first convex optimization approach to learn the maximum 
weighted fc-windmill, a sub-class of the decomposable graph. Bach et al. [T| gave an approach which 
iteratively refines the hyperedge selection based on KL-divergence using iterative scaling. Shahaf et 
al. |26j proposed another convex optimization approach with Bethe approximation of the likelihood 
using graph-cuts. Szantai et al. [21] proposed a hyperedge selection criteria based on high mutual 
information within a hyperedge. Narasimhan et al. |22| performs independence tests by solving 
submodular optimization problems and derives a decomposable graph using dynamic programming. 
Chcchetka et al. [B] used the weaker notion of conditional mutual information instead of conditional 
independence to learn approximate junction trees. Gogate et al. [13] uses low mutual information 
criteria to recursively split the state space to smaller subsets until no further splits are possible. 

In this paper, we make the following contributions: 

• We provide a novel convex relaxation for learning bounded-treewidth decomposable graphical 
models from data in polynomial time. This is achieved by posing the problem as a combina- 
torial optimization problem in Section [2] which is relaxed to a convex optimization problem 
that involves the graphic and hypergraphic matroids, as shown in Section [3] 

• We show in Section [4] how a supergradient ascent method may be used to solve the dual 
optimization problem, using greedy algorithms as inner loops on the two matroids. Each 
iteration has a run-time complexity of 0(k 3 n k+2 logn), where n is the number of variables. 
We also show how to round the obtained fractional solution. 

• We compare our approach to state-of-the-art methods on synthetic datasets and classical 
benchmarks in Section [5] showing the gains of the novel convex approach. 



2 Maximum Likelihood Decomposable Graphical Models 

In this section, we first review the relevant concepts of decomposable graphs and junction trees; for 
more details, see [H |321 US] ■ We then cast the problem of learning the maximum likelihood bounded 
treewidth graph as a combinatorial optimization problem. 
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Figure 1: (a) A decomposable graph on the set of vertices V = {1, 2, 3, 4, 5, 6, 7, 8, 9} having treewidth 
2.(b) A junction tree embedded on the decomposable graph representing the maximal cliques by blue 
dots and the separator sets by blue lines, (c) The corresponding junction tree representation of the 
decomposable graph with ovals representing the maximal cliques and the rectangles representing the 
corresponding separator set. 

2.1 Decomposable graphs and junction trees 

We assume we are given an undirected graph G defined on the set of vertices V = {1,2, ... ,n}. 
Let C(G) denote the set of maximal cliques of G (which we will refer to as cliques). We consider 
n random variables X\, . . . ,X n (referred to as X), associated with each vertex indexed by V. For 
simplicity, they are assumed to be discrete, but this is not a restriction as maximum likelihood will 
use only entropies that can be extended to diffcrcntiablc entropies [8]. 

The distribution p(x) of X is said to factorize in the graph G, if and only if it factorizes as a product 
of potentials that depend only on the variables within maximal cliques. 

A graph is said to be decomposable if it has a junction tree, i.e., a spanning tree whose vertices are 
maximal cliques of G (i.e., C(G) is the vertex set) such that: 

• the junction tree connects only cliques that have a common element (clique tree property), 

• for any vertex i € V, the subgraph of cliques containing i is a tree (running intersection 
property). 



Let T(G) denote the edges of the junction tree over the set of cliques C(G). When the graph G is 
decomposable, the distribution p(x) of X factorizes in G if and only if it may be written as 

Pg(x) = — ^ , (1) 

ll(C,D)eT(G) PcnD(xcnD) 

where x is an instance in the domain of X , which we denote by X . pc(xc) denotes the marginal 
distribution of random variables belonging to C 6 C(G) and PcnD(xcnD) denotes the marginal 
distribution of random variables belonging to the separator set CDD, such that (C, D) 6 T(G). See 
Figure [T] The treewidth of G is the maximal size of the cliques in G, minus one. 

An alternative representation of decomposable graphs may be obtained by considering hyper graphs. 
Hypergraphs are defined by a base set V and a set of hyperedges, i.e., subsets of V. A hypergraph 
is said to be acyclic if and only if the resulting graph obtained by connecting all elements within an 
hyperedge is decomposable. Unfortunately, the nice properties of acyclic graphs do not transfer to 
acyclic hypergraphs. Particulary, the matroid property, which allows exact greedy algorithms, does 
not hold. In Section l3~2l we will use a lesser known more general notion of acyclicity that will lead 
to exact greedy algorithms. 
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2.2 Maximum likelihood estimation 



Given N observations x 1 , . . . , x N of X , we denote the corresponding empirical distribution of X by 
p(x) = j? <5(a; = a; 1 ). Given the structure of a decomposable graph G, the maximum likelihood 
distribution that factorizes in G may be obtained by combining the marginal empirical distributions 
on all maximum cliques and their separators. 

Let p{x) denote the empirical distribution and pg{ x ) denotes the projected distribution on a decom- 
posable graph G. Estimating the maximum likelihood decomposable graph which best approximates 
p is equivalent to finding the graph, G, which minimizes the KL-divergence between the target dis- 
tribution and the projected distribution, pQ, defined by D{p\\po). 

p(x) 



D{p\\p G ) = J^p(a;)log 



xex ' 

— p(x) \ogpa{x) as p(x) is independent of G 

xex 

^ Ucec(G)Pc(xc) 

= > -p(x) log — from Eq. yj) 

^x l\.(c,D)eT(G)PcnD(x C nD) 

= ( -P( x ) lo & II Pc(xc) J - Y ( -P( x ) l °& II PcnD(xcnD) 

xex \ cec(G) ' xex ^ (c.D)eT(G) 

= Y -P( x ) lo SPc(xc) - -p( x ) lo SPcnD(xcnD) 
ceC(G)xex (c,D)eT(G)xex 

= Y Y -Pc(x c )\ogPc(x c ) - Y Y -PcnD(xcnD)\ogp C nD(xcnD) 

CeC(G)x c eX c (C,D)eT(G)x C noeX C nn 

= J2 H ^)- E H(cnD), (2) 

CeC(G) {C,D)eT(G) 

where H{S) is the empirical entropy of the random variables indexed by the set S C V, defined by 
H(S) = J2 xs {~P s ( Xs ^ ^°gPs( x s)}, an d where the sum is taken over all possible values of xs- 

Note that in this paper, we will not be using a traditional model selection term as we will only 
consider models of low tree- width (with a bounded number of parameters) . 



2.3 Combinatorial optimization problem 

We now consider the problem of learning a decomposable graph of treewidth less than k. We assume 
that we are given all entropies H(S) for subsets S of V of cardinality less than k + 1. 

Since we do not add any model selection term, without loss of generality [3D] , we restrict the search 
space to the space of maximal junction trees, i.e., junction trees with n — k maximal cliques of size 
k + 1 and n — k — 1 separator sets of size k between two cliques of size k + 1. Our natural search spaces 
are thus characterized by T>, the set of all subsets of size k + 1 of V, of cardinality (j.?]} , and £ , the 
set of all potential edges in a junction tree, i.e., £ = {(C, D) e V x T> 7 C n D ^ 0, \C n D\ = k}. 
The cardinality of £ is ( fe ™ 2 ) -( k ^ 2 ) (number of subsets of size k + 2 times the number of possibility 
of excluding two elements to obtain a separator). 

A decomposable graph will be represented by a clique selection function r : T> — > {0, 1} and an 
edge selection function p : £ — > {0, 1} so that r(C) = 1 if C is a maximal clique of the graph and 
p(C, D) = 1 if (C, D) is an edge in the junction tree. Both p and r will be referred to as incidence 
functions or incidence vectors, when seen as elements of {0, 1} V and {0, 1} £ . 
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Thus, minimizing the problem defined in Eq. is equivalent to minimizing, 

P(t, P )=J2 H{C)t{C) - H (C n D)p(C, D), (3) 
cev (c,D)e£ 

with the constraint that (t, p) forms a decomposable graph. 

At this time, we have merely reparametcrized the problem with the clique and edge selection func- 
tions. We now consider a set of necessary and sufficient conditions for the pair to form a decom- 
posable graph. Some are convex in (r, p), while some are not. The latter ones will be relaxed in 
Section [3] From now on, we denote by l^gc the indicator function for i G C (i.e., it is equal to 1 if 
i E C and zero otherwise). 

• Covering V: Each vertex in V must be covered by atleast one of the selected cliques, 

V« G V, £ h eC r(C) > 1. (4) 

• Number of edges: Exactly n — k — 1 edges from £ must be selected, 

J2 p(C,D)=n-k-l. (5) 

(C,D)e£ 

• Number of cliques: Exactly n — k cliques from T> must be selected, 

£ r(C) =n-k. (6) 
cev 

• Running intersection property: Every vertex, i 6 V must induce a tree, i.e., the number 
of selected edges containing the vertex, i, must be equal to the number of selected cliques 
containing the vertex, i , minus one. 

V* G V, 1 ie(anD)P(C, D) - ]T h eC r(C) + 1 = 0. (7) 

(c,D)e£ cev 

• Edges between selected cliques: An edge in £ is selected by p only if the cliques it is incident 
on is selected by r. 

VCeP, t(C)= max p(C,D). (8) 
Dev, (c.D)es 

• Acyclicity of p: p selects edges in £ such that they do not have loops, e.g., the blue lines in 
Figure ffl-(b) cannot form loops, 

p represents a subforest of the graph (T>,£). (9) 

• Acyclicity of r: r selects the hyperedges of V in T> such that they are acyclic, i.e., 

t represents an acyclic hypergraph of (V,T>). (10) 



The above constraints encode the classical definition of junction trees. Thus our combinatorial 
problem is exactly equivalent to minimizing P(r,p) defined in Eq. ([3]), subject to the constraints in 
in Eq. (gj), Eq. ©, Eq. ©, Eq. 0, Eq. ©, Eq. © and Eq. (TO). Note that the constraint Eq. ([TO]) 
that t represents an acyclic hypergraph is implied by the other constraints. 

Figure [2] shows clique and edge selections in blue which satisfy all these constraints and hence 
represent a decomposable graph. The clique and edge selections in red violates at least one of these 
constraints. 
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Figure 2: Space of cliques V denoted by ovals and the space of feasible edges £ denoted by lines for 
V = {1,2,3,4,5} and treewidth 2(in Black). Clique and edge selections in blue represent decom- 
posable graphs while those in red denote graphs that are not decomposable (best seen in color). 

3 Convex Relaxation 

We now provide a convex relaxation of the combinatorial problem defined in Section 12.31 The 
covering constraint in Eq. (|4]), the number of edges and the number of cliques constraints in Eq. ([5]) 
and Eq. ([6]) respectively, and the running intersection property in Eq. are already convex in (r, p). 

The constraint in Eq. (J8) that VC G T>, t(C) = max De -p, (c,D)££ p(C,D) may be relaxed into: 

• Edge constraint: selection of edges only if the both the incident cliques are selected, i.e., 

VC e V, V(C, D) e £, p(C, D) < r(C). (11) 

• Clique constraint: selection of a clique if at least an edge incident on it is selected, i.e., 

VCeP, r(C) < ]T P (C,D). (12) 

We now consider the two acyclicity constraints in Eq. © and Eq. (fTUj) . 
3.1 Forest polytope 

Given the graph (I?, £), the forest polytope is the convex hull of all incidence vectors p of subforests 
of (T>, £). Thus, it is exactly the convex hull of all p : £ — >• {0, 1} such that p satisfies the constraint 
in Eq. ©. We may thus relax it into: 

- Tree constraint: 

p is in the forest polytope of {T>,£). (13) 



G 



While the new constraint in Eq. (| 13[) forms a convex constraint, it is crucial that it may be dealt 
with empirically in polynomial time. This is made possible by the fact that one may maximize 
any linear function over that polytope. Indeed, for a weight function w : £ x £ — > R, maximizing 
S(c D)e£ D)p(C, D) is exactly a maximum weight spanning forest problem, and its solution 
may be obtained by Kruskal's algorithm, i.e., (a) order all (potentially negative) weights w(C,D) 
and (b) greedily select edges (C,D), i.e., set p(C,D) = 1, with higher weights first, as long as they 
form a forest and as long as the weights are positive. When we add the restriction that the number 
of edges is fixed (in our case n — k — 1), then the algorithm is stopped when exactly the desired 
number of edges is selected (whether the corresponding weights arc positive or not). See, e.g., (25) . 

The polytope defined above may also be defined as the independence polytope of the graphic matroid, 
which is the traditional reason why the greedy algorithm is exact [25]. In the next section, we show 
how this can be extended to hypergraphs. 

3.2 Hypergraphic matroid 

Given the set of potential cliques V over V, we consider functions r : T> — > {0,1} that are equal 
to one when a clique is selected, and zero otherwise. Ideally, we would like to treat the acyclicity 
of the associated hypergraph in a similar way than for regular graphs. However, the set of acyclic 
subgraphs of the hypergraph defined from T> does not form a matroid, and thus the polytope defined 
as the convex hull of all incidence vectors/functions of acyclic hypergraphs may be defined, but the 
greedy algorithm is not applicable. In order to define what is referred to as the hypergraphic matroid, 
one needs to relax the notion of acyclicity. 

We now follow HQl H2 and define a different notion of acyclicity for hypergraphs. An hypergraph 
(V, J 7 ) is an hyperforest if and only if for all A C V, the number of hyperedges in J- contained in 
A is less than \A\ — 1. A non-trivially equivalent definition is that we can select two elements in 
each hypcredege so that the graph with vertex set V and with edge set composed of these pairs is a 
forest. 

Given an hypergraph with hyperedge set X>, the set of sub-hypergraphs which are hypcrforests 
forms a matroid. This implies that given a weight function on T>, one may find the maximum weight 
hyperforest with a greedy algorithm that ranks all hyperedges and select them as long as they don't 
violate acyclicity (with the notion of acyclicity just defined and for which we exhibit a test below). 

Testing acyclicity. Checking acyclicity of an hypergraph (V, F) (which is needed for the greedy 
algorithm above) may be done by minimizing with respect to A C V 

\A\ !gca. 

The hypergraph is an hyperforest if and only if the minimum is greater or equal to one. The 
minimization of this function may be cast a min-cut / max- flow problem as follows |12j : 

- single source, single sink, one node per hyperedge in J 7 , one node per vertex in V, 

- the source points towards each hyperedge with unit capacity, 

- each hyperedge points towards the vertices it contains, with infinite capacity, 

- each vertex points towards the sink, with unit capacity. 

Link with decomposability. The hypergraph obtained from the maximal cliques of a decom- 
posable graph can easily be seen to be an hyperforest. But the converse is not true. 
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Hyperforest polytope. We can now naturally define the hyperforest polytope as the convex hull 
of all incidence vectors of hyperforests. Thus the constraint in Eq. (| 10[) may be relaxed into: 



Hyperforest constraint: 



t is in the hyperforest polytope of (V,V). 



(14) 



3.3 Relaxed optimization problem 

We can now formulate our combinatorial problem from the constraints in Eq. Eq. Eq. ([B]), 
Eq. 0, Eq. (HI1), Eq. dHJ), Eq. (HU) and Eq. (O as follows 

re{Q,l} v , 
pG {0,1} £ , 



va.Ya.V {t , p) subject to 



V* e K £ CeP l ieC r(C) > 1, 
Y,(c,D)e£P( C ' D ) = n-k-. 



(C,D)e£ 



he(cnD)P(C, D) - £ CeP l ieC r(C) + 1 = 0, 



(15) 



VC G V, V(C, D) G £, p(C, D) < r(C), 
VC &V,t(C) <J2 (C!D)££ p(C,D), 

p is in the forest polytope of (£>,£), 
t is in the hyperforest polytope of (V,T>). 

All constraints except the integrality constraints are convex. Let r-relaxed primal be the partially 
relaxed primal optimization problem formed by relaxing only the integral constraint on r in Eq. (|15|) , 
i.e., replacing r G {0, 1} V by r G [0, l] v . Note that this is not a convex problem due to the remaining 
integral constraint on p, but it remains equivalent to the original problem as the following proposition 
shows. 

73)) and the T-relaxed primal problem are equiv- 



Proposition 1 The combinatorial problem in Eq. 
alent. 

Proof Let us assume (r*,p*) be a feasible solution for the relaxed primal with < t*(C) < 1 for 
some C G T>. The edge constraint in Eq. (fTTj) ensures that there arc no incident edges on C selected 
by p* (as p* is integral). This violates the clique constraint in Eq. (fT2|) . Therefore, the feasible 
solutions of relaxed primal are integral. Hence the optimal solutions of the primal and the relaxed 
primal are identical. ■ 



The convex relaxation for the primal optimization problem formed by relaxing the integral constraint 
on both r and p can now be defined as 

re[0,i] c 

PG [0,1] £ , 



min "P(t, p) subject to < 



VieF.EcG^liecrKC) > 1, 
E(c,r>)££ D) = n-k-l, 
J2cev T ( c ) = n-k, 

Vi e v > E(c D)es l it(cnD)P{C, D) - J2cev U&ct{C) 

VC G V, V(C, D) G £, p(C, D) < r(C), 

VC eV,T(C) <J2 (c<D)eS p(C,D), 

p is in the forest polytope of (2?,£), 

r is in the hyperforest polytope of (V,T>). 



1 = 0, 



(16) 
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4 Solving the dual problem 



We now show how the convex problem may be minimized in polynomial time. Among the constraints 
of our convex problem in Eq. (|15p. some are simple linear constraints, some are complex constraints 
depending on the forest and hyperforest polytopes defined in Section [31 We will define a dual 
optimization problem by introducing the least possible number of Lagrange multipliers (a.k.a. dual 
variables) [3] so that the dual function (and a supcrgradient) may be computed and maximized 
efficiently. We introduce the following dual variables: 

- Set cover constraints in Eq. (|4]): 7 £ R+. 

- Running intersection property in Eq. ([7]): fi £ M. v . 

- Edge constraints in Eq. ((TTJ): A G R+ £ . 

- Clique constraints in Eq. (TT2)) : 77 € R+. 

Therefore, the dual variables are (7, (i, A, rj). Let C(t, p, 7, /j, A, rj) be the Lagrangian relating the 
primal and dual variables. It is derived from the primal cost function defined in Eq. © along with 
the covering constraint, running intersection property, the edge and the clique constraints defined 
in Eq. ([4]), Eq. ([7]), Eq. (fTTj) and Eq. ([T2j) respectively. The Lagrangian can be computed from the 
dual variables (7, fx, A, 77) as follows: 

= £if(C)T(C)- ]T H(CnD)p(C,D) 
cev (C,D)es 

+ E T< f 1 - E 1 ^cr(C) N ) + J> ( E ^(CnujpCC, £>)- E 1 ^cr(C) + 1 
iev ^ cex> ' iev ^-(c,D)es cev 

+ E E A cc (p(C,^)-r(C)) + ^ ??c (r(C)- ]T p(C,Z>; 

= E - E(^ + T*) - E A ^ + »?c) r(C) 

cev ^ iec (c,D)e£ ' 

Y (^(C DD) — Y p 1 -\ C d-Xdc+Vc + Vd)p(C,D) + Y(^+J 1 ), 
(c.D)ee ^ ie(cnD) ' iev 



(17) 



with the following dual constraints on the Lagrange multipliers 



V i G V, 0, 
VCeP, V(C,£>)e£, A CI5 > 0, 

VC eV, ric> 0. (18) 

We can now derive a dual optimization problem with Q(j, fi, A, n) represent the dual cost function, 
which can be derived from the Lagrangian in Eq. (|17p. We use the the number of edges constraint, 
the number of cliques constraint, tree constraint and hyperforest constraint given by Eq. ([5]), Eq. 
Eq. (Tf"5]) and Eq. (TH)) respectively in deriving the dual as follows: 
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2(7, M, A, 77) 



inf H(C7)- XI(M<+7i)- E ^CD+Vc)r(C) 



Ecgd r(C)=n-k 
t(E hyperforest polytopc of (V,T> 



sup ^ Uf(CnD)- IM-^cd-Xdc + Vc + Vd)p(C,D) 



£(C,r>)e« p(C,D)=n-fc-l 
p£forcst polytope of {T> .£) 



+ 52(jH+'Yi). (19) 

iev 

It is decomposed in three parts defined in Eq. (|2ip . Eq. (j2"2")) and Eq. (j2"Td)) respectively : 

2(7, M, A, 77) = Qi(l, A*, A, 77) + q 2 (y,n,\,T]) + 93(7, A, 77), (20) 

where 

gi (7,/i,A,r?)= inf Wtf(C) -J^* + 7< ) _ £ A C z> + ^cV (C$21) 

t£ hyperforest polytopc of (V,X>) 

92(7, M, A, 77) = - sup ^ (iJ(Cn£>)- jtii - A C d - A DC + ??c + Vd Jp(C,D). 

5-<(C,r>)6e P(C,-D)=n-fe-l 
forest polytope of {T> ,£) 

(22) 

93(7,/^, A, 77) = E(^+^)- ( 23 ) 

Therefore, the dual optimization problem using the dual cost function defined in Eq. (|19[) and the 
dual constraints defined in Eq. (fT5| is given by 

max 2(7, /i, A, 77) subject to < VC G £>, V(C, D) G £, A C d > 0, (24) 

[ VC G V, tic > 0. 

The dual functions 91(7, /J, A, 77) and 92(7, M, A, 77) may be computed using the greedy algorithms 
defined in Section [57X1 and Section 15721 q\ can be evaluated in 0(?Tog(r)), where r is the cardinality 
of the space of cliques, X>, i.e., ( fc ™ 1 ) and qi can be evaluated in 0(m log(m)), where m is the 
cardinality of feasible edges, £, i.e., (fc" 2 )'( fc ^ 2 )' This complexity is due to sorting the edges and 
hyperedges based on their weights. This leads to an overall complexity of 0(k 3 n k+2 logn) per 
iteration of the projected supergradient method which we now present. 



Projected supergradient ascent. The dual optimization problem defined by maximizing Q(-f, fi, A, 77) 
can be solved using the projected supergradient method. In each iteration t of the algorithm, the dual 
cost function, Q( 7 ', /-**, A*, 77*), is evaluated through estimation of q\ and 92 by solving Eq. (|2T1) and 
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Algorithm 1 Projected Supergradient 



Input: clique and edge entropies H, step-size constant a and number of iterations T 
Output: sequence of clique and edge selections over iterations (t*,^*) 
Initialize 7 = 0, //> = 0, A = 0, rp = 
for t = to T do 

solve Eq. (f21~|) and evaluate (71(7*, /^*, A*, 77*) to obtain r* 

solve Eq. (f22j) and evaluate 92(7*, M*> A*, 77*) to obtain 

update dual variables, (7* +1 , A* +1 , r;* +1 ) using supcrgradicnts and stepsize: a t 

+ 



^7 



7 



t+1 



7l + at 1-E 



,t+i 



xt+l _ 

a cd - 

vh +1 = 

end for 



"t ( E(c,_D)e£ 1 i€{CnD)P t (C, D) - J2cev hec^iC) + 1 



^CD 

rib 



»*(t*(C7)-p*(C,D 



Eq. d21|) respectively. In the process of solving these equations, the corresponding primal variables 
(t*,/3*) are also estimated and allows the computations of the supergradients of Q (i.e., opposites 
of subgradicnts of — Q) [3]. As shown in Algorithm [TJ a step is made toward the direction of the 
supergradient and projection onto the positive orthant is performed for dual variables that are con- 
strained to be nonnegative. With step sizes at proportional to 1/Vt, this algorithm is known to 
converge to a dual optimal solution [33J at rate l/s/i. Moreover, the average of all visited primal 
variables, i.e., after t steps, (ji,pt) = 7 EL=o( T "> P U ) IS known to be approximately primal-feasible 
(i.e., it satisfies all the linear constraints that were dualized up to a small constant that is also going 
to zero at rate l/^/t). The convergence to primal feasibility is illustrated in Figure 0Ja), where, on 
one of the synthetic examples from Section [5l the different constraint violations. Note that these 
are not the number of each of these constraints violated but the maximum value by which they are 
violated. It can be observed that the constraint violations reduce to zero over iterations. 

Proposition 2 If k = 1, the convex relaxation in Eq. il6}) is equivalent to Eq. \15\) . 

Proof If k = 1, all the cliques in the clique space contain only 2 vertices, i.e., VC G T>, \C\ = 2 and 
the number of elements in the feasible edges is only 1, i.e., V(C, D) S £, \C n D\ = 1. 

Solving the convex relaxation defined in Eq. (fTS]) is equivalent to solving the dual defined in Eq. (|24[) . 
On solving the dual variables, the optimal dual solution is given by 

VieV,m = H{{i}), 

ViGV,7i = 0, , , 

VC e V, V(C, D) e £,Xcd = 0, 1 > 

VCe2V ? c = 0, 

where H({i}) = —pi(xi) log(pi(ar,)). 



11 



The optimal solution to the dual problem is given by 



Q*( 7 ,/M,*?) = mf £ ^-E^W) T(C) + ^if(W) 

r(C )e [0lP ^ / ^ 

r€ hyperforest polytope of (V,X>) 

inf _J(C). r (C)+ £ #({*}), (26) 

r(c)e[o,i] 1 ' ^ 

Eceb r(C)=n-k 
r£ hyperforest polytope of (V,!)) 

where VC £ 2?, 1(C) = J2iec ^({^))~ H(C), which defines the mutual information of the elements in 
the clique, i.e., an edge if k = 1. The constraints in Eq. (|26|) define a spanning tree polytope [25] and 
the optimal solution is a maximal information spanning tree, which is given by Chow-Liu trees [7|. 
They also form the optimal solution to the non-convex primal optimization defined in Eq. f|15[) . ■ 



Algorithm 2 Approximate Greedy Primal Solution 

Input: primal infcasible sequence r* for Algorithm [1] treewidth k, number of Vertices n, set of 
cliques T> and integer m such that < m < T 

Output: approximate discrete primal feasible solution r m after m iterations of Algorithm [1] 
Initialize Adjacency Matrix Adj = zeros(n,n), f m = ^ X)t=o T * ant ^ T " 1 = zeros(length(T m )) 
order = Sorted indices in the descending order f m 
repeat 

Initialize decomposable = false, treewidth = 0, numC 'onnectedC 'omponents = 0, i = 1 
update TestAdj = AddClique(Adj, V(order(i))) 

update [decomposable, treewidth] = chcckGraphDecomposability(Test^4cf/) 
if decomposable = true and treewidth < fc then 

update Aij = TestAdj 

update T m (Order(i)) = 1 
end if 

[numC onnectedC omponents] = getNumberConnectedComponents(Tes£A<i/) 
update i = i + 1 

until decomposable = true, treewidth — k, numC onnectedC omponents = 1, i = lcngth(orefer) 



Approximate Greedy Primal Solution. We describe an algorithm to project from the average 
of a sequence of fractional primary infeasible solutions, estimated during the iterations of projec- 
tive supergradient, to an integral primary feasible solution. "AddCliquc" adds all the edges of a 
clique to the adjacency matrix. "chcckGrapIiDecomposability" checks if the maximal cardinality 
search is a perfect elimination ordering. For decomposable graphs the maximal cardinality search 
yields a perfect elimination ordering |14j . We refer to this as decompos ability test in this paper. 
"gctNumberConnectedComponents" gives the number of connected components in the graph using 
breadth- first search. Note that the projection only uses the average clique selection function, t m , to 
obtain the primary feasible solutions, r m . The corresponding edge selection, p m , can be estimated 
from clique selection, r m , by selecting the edges between consecutive cliques of the perfect sequence 
of selected cliques [T5]. The time complexity of the projection algorithm is 0(n k+2 ). This is due to 
decomposability test with run time complexity 0(n k+1 ), that is performed on adding 0(n) cliques. 
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5 Experiments and Results 



In this section, we show the performance of the proposed algorithm on synthetic datasets and classical 
benchmarks. 



Decomposable covariance matrices. In order to easily generate controllable distributions with 
entropies which are easy to compute, we use several decomposable graphs and we consider a Gaussian 
vector with covariance matrix E, generated as follows: 

• sample a matrix Z of dimensions n x d! with entries uniform in [0, 1] and consider the matrix 

E' = |ZZ T + (1 - (27) 

where Z is a random matrix of dimensions n x d! , I is the n-dimcnsional identity matrix and 
d is a parameter to determine the correlations between the nodes of the graph, which takes 
values in {0 7 d'}. In our experiments, we choose d! to be 128. We have tight correlations 
between the nodes with higher values of d. 

• normalize E' to unit diagonal, and 

• The normalized random positive definite covariance matrix, E', is projected onto a decompos- 
able graph G as follows: 



(£)-!= £ [(^n„- iVc^rx, (28) 

GeC(G) (C,.D)eT(G) 



where the operator [(E x ) _1 ] n gives annxn matrix whose columns and rows representing the 
set X C V arc filled by (E x ) _1 and the rest of the elements of the matrix are filled with zeroes. 
The matrix, E, thus generated represents the covariance matrix of a multivariate Gaussian on 
a decomposable graph, G. 

The projection ensures the following relationship between the random positive definite matrix, E' 
and the projected covariance matrix S: 

E(i,j) = E'(i > i)if^(i,j) = lori = j > 

= 0HA(i,j) = 0. (29) 

where A is the adjacency matrix of the decomposable graph G onto which E' was projected. 

The entropy of a multivariate Gaussian with a covariance matrix, E, is given by ^ log(27re)"|E|, 
where |E| denotes the determinant of the covariance matrix. However, for Gaussian distribution 
that is factored in G € Q : 

\n - n ncgc(G)l |i cl , , (30) 

ll((C,D)€T(G) \^CnD\ 

where Ex is the sub-matrix of the covariance matrix whose rows and columns belong to the set 
X QV. Therefore, for any multivariate decomposable Gaussian graphical model, G: 

TUG) = ibg((27rer|E|) 

= \{ log((27re)»|E |) - £ log((2 7 re)"|Ecni 3 |)) 

cec(G) (C,z>)eT(G) 

= E H (°)- E H(CHD). (31) 

CGC(G) (C,D)eT(G) 
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Note that the entropy of any graph, G, is independent of the mean of the normal distribution, hence 
we consider only the covariance matrix. 

We use the graph structures representing a chain junction tree as in Figurc[3]-(a) an d a star junction 
tree as in Figure [3]-(b) to analyze the performance of our algorithm for decomposable covariance 
matrices generated with different correlations. 

Table Q] and Table [2] show the performance of our algorithm on these two graphs. Decomposable 
covariance matrices are generated as above with different values of the correlation parameter d (all 
averaged over ten different random covariance matrices). We show the difference between the cost 
function in Eq. (|3|) and the optimal entropy, i.e., the one of the actual structure represented by the 
covariance matrices. The differences in the table are multiplied by 10 3 for brevity. 

The first column ADual represents the optimal value of our convex relaxation (obtained from the 
dual function), while the second column ADuaF represents the optimal value by replacing the 
hypcrforest constraint by the simply r £ [0, l] v . We can see from the two tables, that the two values 
are strictly negative (i.e., we indeed have a relaxation) and that the hypcrforest constraint is key to 
obtaining tighter relaxations. Note that the associated solutions are only fractional. 

The third column APrimal represents the cost function obtained by projection of the optimal frac- 
tional solution of the hypcrforest constraint, using Approximate Greedy Primal Solution algorithm. 
The fourth column APrimaF represents the cost function obtained by projecting the optimal frac- 
tional solution of the hypercube constraint, i.e., the corresponding primal feasible solution related to 
ADuaF . They are compared to a simple greedy algorithm in the fifth column that sorts all mutual 
information and keep adding the cliques with largest mutual information as long as decomposability 
is maintained. Although the relaxation is not tight, our rounding scheme leads empirically to the 
optimal solution when the correlations are strong enough (i.e., large values of d) and outperform the 
simple greedy algorithm. 




(a) (b) 

Figure 3: Graph representing (a) chain junction tree, (b) star junction tree, with an embedded 
junction tree in green and its junction tree representation in blue. 

Performance Comparison. We compare the quality of the graph structures learned by the 
proposed algorithm with the ones produced by Ordering Based Search (OBS) [3T], the combina- 
torial optimization algorithm proposed by Karger and Srebro (Kargcr+Srebro) [15] . the Chow-Liu 
trees (Chow-Liu) [7] and different variations of PAC-lcarning based algorithms (PAC-JT, PAC- 
JT+local) [6j. We use a real- world dataset, TRAFFIC [18] and an artificial dataset, ALARM [2] to 
compare the performances of these algorithms. 
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Figure 4: Left: (a) Upper bound of constraint violations for d=2 and a chain junction tree. Right: 
Log likelihood of the structures learnt using various algorithms on (b) TRAFFIC and (c) ALARM 
datasets with k = 3 except Chow-Liu (k = 1). 



Table 1: Performance on chain junction trees. See text for details. 



d 


ADual 


ADuaF 


APrimal 


APrimaF 


AGreedy 


1 


-0.7±0.1 


-32.7±16.4 


0.2±0.1 


0.4±0.1 


0.2±0.1 


2 


-0.4±0.1 


-23.4±9.6 





0.3±0.2 


0.5±0.2 


4 


-l.liO.l 


-31.2±9.7 





0.3±0.1 


1.9±0.3 


8 


-0.6±0.1 


-23.9±9.8 





0.2±0.1 


7.9±0.3 


16 


-1.9±0.2 


- 3.4±2.7 








25.6±1.2 


32 


-3.9±0.5 


- 3.2±0.3 








57.3±1.5 



Table 2: Performance on star junction trees. See text for details. 



d 


ADual 


ADuaF 


APrimal 


APrimaF 


AGreedy 


1 


-0.8±0.1 


-31.4±13.4 


0.2±0.1 


0.5±0.1 


0.9±0.1 


2 


-0.5±0.2 


-26.6±13.3 





0.4±0.1 


0.4±0.3 


4 


-0.3±0.0 


-16.6±4.1 





0.2±0.1 


1.7±0.2 


8 


-0.4±0.0 


-16.0±9.6 








6.9±0.3 


16 


-1.2±0.5 


-3.1±0.3 








26.3±1.5 


32 


-6.8±0.4 


-8.5±1.2 








58.3±1.9 



This ALARM dataset was sampled from a known Bayesian network [2] of 37 nodes, which has a 
treewidth equal to 4. We learn an approximate decomposable graph of treewidth 3. The TRAFFIC 
dataset is the traffic flow information every 5 minutes for a month at 8000 locations in California [18] . 
The traffic flow information is collected at 32 locations in San Francisco Bay area and the values 
are discrctized into 4 bins. We learn an approximate decomposable graph of treewidth 3 using 
our approach. Empirical entropies are computed from the generated samples of each data set and 
we infer the underlying structure from them using our algorithm. Figure BJb) and Figure |4|c) 
show the log-likelihoods of structures learnt using various algorithms on Traffic and Alarm datasets 
respectively. Note that the performance is better with higher values as we compare log-likelihoods. 
These figures illustrate the gains of the convex approach over the earlier non-convex approaches. 
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6 Conclusion and Future Work 



In this paper, we have provided a convex relaxation to the problem of finding the maximum likelihood 
decomposable graph with bounded treewidth, with a polynomial-time optimization algorithm, which 
empirically outperforms previously proposed algorithms. We are currently exploring two avenues for 
improvements: (a) design sufficient conditions for tightness of our relaxation, following the recent 
literature on relaxation of variable selection problems [5], and (b) use heuristics to speed-up the 
algorithms to allow application to larger graphs. 
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